Search CORE

44 research outputs found

Beyond Trending Topics: Real-World Event Identification on Twitter

Author: Becker Hila
Gravano Luis
Naaman Mor
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2011
Field of study

User-contributed messages on social media sites such as Twitter have emerged as powerful, real-time means of information sharing on the Web. These short messages tend to reflect a variety of events in real time, earlier than other social media sites such as Flickr or YouTube, making Twitter particularly well suited as a source of real-time event content. In this paper, we explore approaches for analyzing the stream of Twitter messages to distinguish between messages about real-world events and non-event messages. Our approach relies on a rich family of aggregate statistics of topically similar message clusters, including temporal, social, topical, and Twitter-centric features. Our large-scale experiments over millions of Twitter messages show the effectiveness of our approach for surfacing real-world event content on Twitter

CiteSeerX

Columbia University Academic Commons

Association for the Advancement of Artificial Intelligence: AAAI Publications

Recommended from our members

A Genre-based Clustering Approach to Content Extraction

Author: Gupta Suhit
Becker Hila
Kaiser Gail E.
Stolfo Salvatore
Publication venue: Department of Computer Science, Columbia University
Publication date: 26/02/2004
Field of study

The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter (defined as cosmetic features such as animations, menus, sidebars, obtrusive banners). Automatic content extraction has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. We have developed a framework, Crunch, which employs various heuristics for content extraction in the form of filters applied to the webpage's DOM tree; the filters aim to prune or transform the clutter, leaving only the content. Crunch allows users to tune what we call 'settings', consisting of thresholds for applying a particular filter and/or for toggling a filter on/off, because the HTML components that characterize clutter can vary significantly from website to website. However, we have found that the same settings tend to work well across different websites of the same genre, e.g., news or shopping, since the designers often employ similar page layouts. In particular, Crunch could obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings. We present our approach to clustering a large corpus of websites into genres, using their pre-extraction textual material augmented by the snippets generated by searching for the website's domain name in web search engines. Including these snippets increases the frequency of function words needed for clustering. We use existing Manhattan distance measure and hierarchical clustering techniques, with some modifications, to pre-classify the corpus into genres offline. Our method does not require prior knowledge of the set of genres that websites fit into, but to be useful a priori settings must be available for some member of each cluster or a nearby cluster (otherwise defaults are used). Crunch classifies newly encountered websites online in linear-time, and then applies the corresponding filter settings, with no noticeable delay added by our content-extracting web proxy

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Recommended from our members

A Genre-based Clustering Approach to Content Extraction

Author: Becker Hila
Gupta Suhit
Kaiser Gail E.
Stolfo Salvatore
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2005
Field of study

Columbia University Academic Commons

Recommended from our members

Genre Classification of Websites Using Search Engine Snippets

Author: Becker Hila
Gupta Suhit
Kaiser Gail E.
Stolfo Salvatore
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2005
Field of study

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Automatic extraction of 'useful and relevant' content from web pages has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. Prior work has led to the development of Crunch, a framework which employs various heuristics in the form of filters and filter settings for content extraction. Crunch allows users to tune these settings, essentially the thresholds for applying each filter. However, in order to reduce human involvement in selecting these heuristic settings, we have extended this work to utilize a website's classification, defined by its genre and physical layout. In particular, Crunch would then obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings - which in practice produces better content extraction results than a single one-size-fits-all set of setting defaults. In this paper, we present our approach to clustering a large corpus of websites by their genre, utilizing the snippets generated by sending the website's domain name to search engines as well as the website's own text. We find that exploiting these snippets not only increased the frequency of function words that directly assist in detecting the genre of a website, but also allow for easier clustering of websites. We use existing techniques like Manhattan distance measure and Hierarchical clustering, with some modifications, to pre-classify websites into genres. Our clustering method does not require prior knowledge of the set of genres that websites fit into, but instead discovers these relationships among websites. Subsequently, we are able to classify newly encountered websites in linear-time, and then apply the corresponding filter settings, with no noticeable delay introduced for the content-extracting web proxy

Columbia University Academic Commons

SONAR: Automatic detection of cyber security events over the twitter stream

Author: Allan James
Applegate S. D.
Becker Hila
Charu
Mikolov Tomas
Simmons Chris
Weng Jianshu
Zhao Qiankun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/08/2017
Field of study

© 2017 ACM. Everyday, security- experts face a grim ing number of security events that affecting people well-being, their information systems and sometimes the critical infrastructure. The sooner they can detect and understand these threats, the more they can mitigate and forensically investigate them Therefore, they need to have a situation awareness of the existing security events and their possible effects. However, given the large number of events, it can be difficult for security analysts and researchers to handle this flow of information in an adequate manner and answer the following questions in near- real time: what are the current security events? How long do they last? In this paper, we will try to answer these issues by leveraging social networks that contain a massive amount of valuable information on many topics. I lowever. because of the very- high volume, extracting meaningful information can be challenging. For this reason, we propose SONAR: An automatic, self-learned framework that can detect geolocate and categorize cyber security events in near-real time over the Twitter stream. SONAR is based on a taxonomy- of cyber security events and a set of seed keywords describing type of events that we want to follow in order to start detecting events. Using these seed keywords, it automatically discovers new relevant keywords such as malware names to enhance the range of detection while staying in the same domain. Using a custom taxonomy describing all type of cyber threats, we demonstrate the capabilities of SONAR on a dataset of approximately 47.8 million tweets related to cyber security in the last 9 months. SONAR could efficiently and effectively detect, categorize and monitor cyber security related events before getting on the security news, and it could automatically discover new security terminologies with their event. Additionally. SONAR is highly scalable and customizable by design; therefore we could adapt SONAR framework for virtually any type of events that experts are interested in

ZU Scholars (Zayed University)

Crossref

Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media

Author: Becker Hila
Flora
Ji Heng
Khandpur Rupinder P.
Lee Wenke
Li Frank
Liu Yang
Modi A.
Muthiah Sathappan
Ovelgonne Michael
Rehurek Radim
Sabottke Carl
Soska Kyle
Tanev Hristo
Weller-Fahy David J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/02/2017
Field of study

Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDOS) attacks, data breaches, and account hijacking) in an unsupervised manner using just a limited fixed set of seed event triggers. A new query expansion strategy based on convolutional kernels and dependency parses helps model reporting structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.Comment: 13 single column pages, 5 figures, submitted to KDD 201

arXiv.org e-Print Archive

Crossref

Addition of elotuzumab to lenalidomide and dexamethasone for patients with newly diagnosed, transplantation ineligible multiple myeloma (ELOQUENT-1): an open-label, multicentre, randomised, phase 3 trial

Author: Acs Peter
Afanasyev Boris
Alegre Adrián
Alencar Alvaro
Ali Sarah
Arledge Sherri
Augustson Bradley
Bahlis Nizar J
Bahlis Nizar J
Baldini Luca
Balea Marius
Ballester Oscar
Becker Pamela
Behl Rajesh
Beksaç Meral
Beksaç Meral
Belani Rajesh
Belch Andrew
Belch Andrew
Ben-Yehuda Dina
Bengoechea Enrique
Bensinger William
Benson Don
Berdeja Jesús
Berdeja Jesús
Besemer Britta
Bhaskar Birbal
Bladé Joan
Blau Igor W
Borsaru Gabriela
Bosi Alberto
Braester Andrei
Brenner Warren
Briasoulis Evangelos
Brioli Annamaria
Cabanillas Fernando
Caputto Salvador
Caravita di Toritto Tommaso
Carboneras Ana V
Carella Angelo M
Carney Jennifer
Casado Luis F
Castagnari Barbara
Catalano John
Catlett Joseph
Cavo Michele
Cavo Michele
Chaudhry Madhu
Chen Franklin
Chhabra Saurabh
Chidiac Tarek
Chubar Evgeni
Ciceri Fabio
Clarkson David
Clotilde Cangialosi
Cohen Gary
Cohen Yossi
Cortes Hector V
Cosgriff Thomas
Costa Luciano
Coyne Mark
D'Rozario James M
Danaila Catalin
Dao Vi
De Fabritiis Paolo
De Prijck Bernard
Delforge Michel
Delrieu Vanessa
Demuynck Hilde
Deptała Andrzej
Dhodapkar Madhav
Dimopoulos Meletios A
Dimopoulos Meletios A
Doyen Chantal
Dries Deeren
Driessen Christoph
Dürk Heinz A
Eek Richard W
Emanuele Angelucci
Encinas Cristina
Fabbiano Francesco
Faber Edward
Fanning Suzanne
Fay Joseph
Filshie Robin
Fox Susan
Foà Roberto
Gamberi Barbara
Ganetsky Alex
Ganetsky Alex
García Miguel TH
Gasztonyi Zoltán
Gayoso Jorge
Geils George
Gheorghita Emanuil
Goldschmidt Hartmut
Goldschmidt Hartmut
Goñi María A
Granell Miguel
Gray Carl
Gray James X
Gregora Evzen
Gressot Laurent
Grigg Andrew
Grosicki Sebastian
Grosicki Sebastian
Gutiérrez María AE
Hamed Aryan
Hansen Vincent
Hardan Izhar
Hayden Patrick
Hellmann Andrzej
Hertzberg Mark S
Hirsch Robert
Hohl Raymond
Holden Viran
Horowitz Netanel A
Horvath Noemi
Hänel Mathias
Illés Árpád
Jagasia Madan
Jakubowiak Andrzej
Jhangiani Haresh
Johnston Amanda
Jou Ying-Ming
Jou Ying-Ming
Jędrzejczak Wiesław W
Kapsali Eleni
Kasbari Samer
Kaufman Jonathan L
Kaufman Jonathan L
Kerkhoff Andrea
Khan Cyrus MA
Khasawneh Mohamad
Khojasteh Ali
Klausmann Martine
Komarnicki Mieczysław
Kotb Rami
Kropff Martin
Kuliczkowski Kazimierz
Kuriakose Philip
Kyriakou Despoina
Kyrtsonis Maria-Christine
Kühr Thomas
Kłoczko Janusz
Lahuerta Juan J
Langer Christian
Larocca Alessandra
Larocca Alessandra
Latimer Maya
Laubach Jacob P
Laubach Jacob P
Lazaroiu Mihaela
Lazaroiu Mihaela
Leblanc Richard
Lech-Marańda Ewa
Lee Huey-Shin Cindy
Legieć Wojciech
Legieć Wojciech
Levy Moshe
Liberati Anna M
Liberati Anna M
Liem Kiem
Liman Andrew
Lonial Sagar
Lonial Sagar
Louzada Martha
Lugassy Gilles
Lunning Matthew
Macfarlane Donald
Maciejewski John
Magen Hila
Magen Hila
Mahmood Aftab
Mahmood Tariq
Maisnar Vladimir
Malcolm Albert
Manges Robert
Martelli Maurizio
Masszi Tamás
Mateos María-Victoria
Mateos María-Victoria
Meza Luis
Mikala Gábor
Mims Martha
Mineur Philippe
Minuk Leonard
Mittelman Moshe
Modiano Manuel
Moezi Mehdi
Munder Markus
Musuraca Gerardo
Mügge Lars-Olof
Nair Rajesh
Nemets Anatoly
Noga Stephen
O'Dwyer Michael
O'Gorman Peter
Oakervee Heather
Offidani Massimo
Oliff Ira
Oriol Albert
Pabst Thomas
Padmanabhan Arvinda
Palumbo Antonio
Panelli Victoria
Paner Agne
Patel Ravindranath
Petrucci Maria T
Petrucci Maria T
Pezzatti Sara
Pinto Antonello
Podhorecka Monika
Pogliani Enrico M
Popa McKiver Mihaela
Popa McKiver Mihaela
Porterfield Bruce
Rabin Neil
Rahimi-Levene Naomi
Reece Donna
Reece Donna
Reeves James
Reiman Anthony
Richard Robert
Richardson Paul G
Richardson Paul G
Roa Sch
Robak Tadeusz
Rosenbaum Cara
Rossiev Victor
Rossini Fausto
Rosta András
Rowlings Philip
Rubenstein S. Eric
Ruiz Marco
Röllig Christoph
Sahovic Entezam
Samaras Christy
San-Miguel Jesús
Sandhu Irwindeep
Sanyal Amit
Scheid Christof
Schiller Gary
Schots Rik
Schultz Stephen M
Sebag Michael
Sehgal Rajesh
Shacham-Abulafia Adi
Shieh Marie
Shpilberg Ofer
Staszewski Harry
Stevens Don
Stockerl-Goldstein Keith
Stoia Razvan
Strnad Charles
Stuart Robert
Symeonidis Argiris S
Tosi Patrizia
Usnarska-Zubkiewicz Lidia
Vacca Angelo
Vaughn Christopher
Vaxman Iuliana
Vekemans Marie-Christine
Vladareanu Ana-Maria
Vose Julie
Vrindavanam Nandagopal
Walker Patricia A
Weaver Robert
Weisel Katja
Weisel Katja
Welslau Manfred
White Darrell
White Darrell
Windsor Kevin
Wright Matthew P
Yen Charles
Špička Ivan
Špička Ivan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Institutional Research Information System University of Turin

Real-time Ranking with Concept Drift Using Expert Advice

Author: Hila Becker
Publication venue
Publication date: 01/01/2007
Field of study

In many practical applications, one is interested in generating a ranked list of items using information mined from continuous streams of data. For example, in the context of computer networks, one might want to generate lists of nodes ranked according to their susceptibility to attack. In addition, real-world data streams often exhibit concept drift, making the learning task even more challenging. We present an online learning approach to ranking with concept drift, using weighted majority techniques. By continuously modeling different snapshots of the data and tuning our measure of belief in these models over time, we capture changes in the underlying concept and adapt our predictions accordingly. We measure the performance of our algorithm on real electricity data as well as a synthetic data stream, and demonstrate that our approach to ranking from stream data outperforms previously known batch-learning methods and other online methods that do not account for concept drift

CiteSeerX

Identification and Characterization of Events in Social Media

Author: Becker Hila
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2011
Field of study

Millions of users share their experiences, thoughts, and interests online, through social media sites (e.g., Twitter, Flickr, YouTube). As a result, these sites host a substantial number of user-contributed documents (e.g., textual messages, photographs, videos) for a wide variety of events (e.g., concerts, political demonstrations, earthquakes). In this dissertation, we present techniques for leveraging the wealth of available social media documents to identify and characterize events of different types and scale. By automatically identifying and characterizing events and their associated user-contributed social media documents, we can ultimately offer substantial improvements in browsing and search quality for event content. To understand the types of events that exist in social media, we first characterize a large set of events using their associated social media documents. Specifically, we develop a taxonomy of events in social media, identify important dimensions along which they can be categorized, and determine the key distinguishing features that can be derived from their associated documents. We quantitatively examine the computed features for different categories of events, and establish that significant differences can be detected across categories. Importantly, we observe differences between events and other non-event content that exists in social media. We use these observations to inform our event identification techniques. To identify events in social media, we follow two possible scenarios. In one scenario, we do not have any information about the events that are reflected in the data. In this scenario, we use an online clustering framework to identify these unknown events and their associated social media documents. To distinguish between event and non-event content, we develop event classification techniques that rely on a rich family of aggregate cluster statistics, including temporal, social, topical, and platform-centric characteristics. In addition, to tailor the clustering framework to the social media domain, we develop similarity metric learning techniques for social media documents, exploiting the variety of document context features, both textual and non-textual. In our alternative event identification scenario, the events of interest are known, through user-contributed event aggregation platforms (e.g., Last.fm events, EventBrite, Facebook events). In this scenario, we can identify social media documents for the known events by exploiting known event features, such as the event title, venue, and time. While this event information is generally helpful and easy to collect, it is often noisy and ambiguous. To address this challenge, we develop query formulation strategies for retrieving event content on different social media sites. Specifically, we propose a two-step query formulation approach, with a first step that uses highly specific queries aimed at achieving high-precision results, and a second step that builds on these high-precision results, using term extraction and frequency analysis, with the goal of improving recall. Importantly, we demonstrate how event-related documents from one social media site can be used to enhance the identification of documents for the event on another social media site, thus contributing to the diversity of information that we identify. The number of social media documents that our techniques identify for each event is potentially large. To avoid overwhelming users with unmanageable volumes of event information, we design techniques for selecting a subset of documents from the total number of documents that we identify for each event. Specifically, we aim to select high-quality, relevant documents that reflect useful event information. For this content selection task, we experiment with several centrality-based techniques that consider the similarity of each event-related document to the central theme of its associated event and to other social media documents that correspond to the same event. We then evaluate both the relative and overall user satisfaction with the selected social media documents for each event. The existing tools to find and organize social media event content are extremely limited. This dissertation presents robust ways to organize and filter this noisy but powerful event information. With our event identification, characterization, and content selection techniques, we provide new opportunities for exploring and interacting with a diverse set of social media documents that reflect timely and revealing event content. Overall, the work presented in this dissertation provides an essential methodology for organizing social media documents that reflect event information, towards improved browsing and search for social media event data

CiteSeerX

Columbia University Academic Commons

A Genre-based Clustering Approach to Content Extraction

Author: Hila Becker
Suhit Gupta
Publication venue
Publication date: 14/08/2008
Field of study

CiteSeerX